{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Generating incorrect answer suggestions\n",
    "Using word embeddings we're going to find the most similar words to an answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Importing the word embeddings\n",
    "Unfortunately our beloved *spacy* does not offer most similar words. We'll use **gensim** for that.\n",
    "\n",
    "Make sure you download the embeddings file first. Instructions in the *README* in the **data** folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gensim\n",
    "from gensim.test.utils import datapath, get_tmpfile\n",
    "from gensim.models import KeyedVectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "glove_file = '../data/embeddings/glove.6B.300d.txt'\n",
    "tmp_file = '../data/embeddings/word2vec-glove.6B.300d.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.scripts.glove2word2vec import glove2word2vec\n",
    "glove2word2vec(glove_file, tmp_file)\n",
    "model = KeyedVectors.load_word2vec_format(tmp_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Similar words examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('probo', 0.5426342487335205),\n",
       " ('koalas', 0.4729689359664917),\n",
       " ('orangutan', 0.4557289779186249),\n",
       " ('grizzly', 0.41816502809524536),\n",
       " ('marsupial', 0.39361128211021423),\n",
       " ('wombat', 0.3832378685474396),\n",
       " ('cuddly', 0.3804110288619995),\n",
       " ('kodiak', 0.37843799591064453),\n",
       " ('kade', 0.37742379307746887),\n",
       " ('kangaroo', 0.3612629175186157)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['koala'], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems to be working fine. Though what *the f* * is a probo?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image.png](https://i.gyazo.com/8e982abd6da0025cb985b388c07507a8.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point we asume that we have our answer, the sentence it's in, the entire text, and the title. Let's explore some words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*__Oxygen__ is a chemical element with symbol O and atomic number 8.*  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('hydrogen', 0.63267982006073),\n",
       " ('nitrogen', 0.6251459717750549),\n",
       " ('helium', 0.5435217022895813),\n",
       " ('nutrients', 0.5369840860366821),\n",
       " ('breathing', 0.5023170709609985),\n",
       " ('chlorine', 0.4946938157081604),\n",
       " ('monoxide', 0.4911428987979889),\n",
       " ('dioxide', 0.4911195933818817),\n",
       " ('ammonia', 0.49079084396362305),\n",
       " ('carbon', 0.4836854636669159)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['oxygen'], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That was easy. Let's try something more difficult.\n",
    "\n",
    "*the oldest portuguese university was first established in **lisbon** before moving to coimbra.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('portugal', 0.6408252716064453),\n",
       " ('porto', 0.5835250616073608),\n",
       " ('benfica', 0.5504175424575806),\n",
       " ('copenhagen', 0.5288481712341309),\n",
       " ('portuguese', 0.5266897678375244),\n",
       " ('madrid', 0.5219067335128784),\n",
       " ('brussels', 0.5173484683036804),\n",
       " ('oporto', 0.5147969126701355),\n",
       " ('prague', 0.5037161707878113),\n",
       " ('amsterdam', 0.5018222332000732)]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['lisbon'], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Seems like we are getting closer to *football teams* rather than *cities with old universities*. Let's add some more words from the sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('faculty', 0.5288037061691284),\n",
       " ('college', 0.523701012134552),\n",
       " ('professor', 0.5193326473236084),\n",
       " ('graduate', 0.5135288834571838),\n",
       " ('universities', 0.5098860859870911),\n",
       " ('copenhagen', 0.5022274255752563),\n",
       " ('campus', 0.4942850172519684),\n",
       " ('prague', 0.4880773425102234),\n",
       " ('madrid', 0.4852182865142822),\n",
       " ('portugal', 0.4788099527359009)]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['lisbon', 'university'], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! But now the words are getting too close to university. It would be good if we can add more weight to the orignal answer.\n",
    "\n",
    "I can manually do it by taking the best embeddings to the original answer and counting how many times they occur in the joint embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('porto', 0.6089159846305847),\n",
       " ('portugal', 0.6070287823677063),\n",
       " ('oporto', 0.5988742113113403),\n",
       " ('braga', 0.5796492099761963),\n",
       " ('benfica', 0.5514551401138306),\n",
       " ('leiria', 0.5170067548751831),\n",
       " ('aveiro', 0.4983532428741455),\n",
       " ('viseu', 0.491713285446167),\n",
       " ('évora', 0.4914955198764801),\n",
       " ('são', 0.4868907928466797)]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['lisbon', 'coimbra'], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using another city really makes a difference and shows some good candidates. I think it'll be a good idea to use words in the sentence that are next to the answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Words with the same stem"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('writing', 0.6969849467277527),\n",
       " ('read', 0.6291235089302063),\n",
       " ('wrote', 0.6251993179321289),\n",
       " ('written', 0.6065735816955566),\n",
       " ('publish', 0.5670630931854248),\n",
       " (\"'d\", 0.5343195796012878),\n",
       " ('writes', 0.5341792702674866),\n",
       " ('tell', 0.5337096452713013),\n",
       " ('you', 0.5316603779792786),\n",
       " ('books', 0.5285096168518066)]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['write'], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could just remove all similar words that have the same stem as the original answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally, the incorrect answers should be the same part of speech. Like with **write** - *read*, *publish*, *tell* are good candidates, but *books* could be easily discarded for being a noun."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Numbers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('1943', 0.9581360220909119),\n",
       " ('1942', 0.9418259859085083),\n",
       " ('1941', 0.9256348609924316),\n",
       " ('1940', 0.8975383043289185),\n",
       " ('1945', 0.8817087411880493),\n",
       " ('1939', 0.8315708637237549),\n",
       " ('1946', 0.8234671950340271),\n",
       " ('1938', 0.781980574131012),\n",
       " ('1937', 0.7764101028442383),\n",
       " ('1935', 0.7516504526138306)]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['1944'], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not that bad. They seem to gravitate around the events of WW2. It seems better than ramdon numbers or closest numbers if we need to have multiple answer question. But I think it may be a better question if you have to input the number yourself, and you get a better score if you are closer to the correct answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('clinton', 0.7889922261238098),\n",
       " ('obama', 0.7570987939834595),\n",
       " ('gore', 0.6871949434280396),\n",
       " ('w.', 0.6750580072402954),\n",
       " ('cheney', 0.6621242761611938),\n",
       " ('mccain', 0.6613168716430664),\n",
       " ('barack', 0.6568867564201355),\n",
       " ('administration', 0.6468127965927124),\n",
       " ('george', 0.6463572978973389),\n",
       " ('kerry', 0.6004412174224854)]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['bush'], topn=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('postulate', 0.4412064254283905),\n",
       " ('archimedes', 0.43941453099250793),\n",
       " ('n.e.', 0.39649108052253723),\n",
       " ('pythagoras', 0.39116495847702026),\n",
       " ('aristotle', 0.3895653486251831),\n",
       " ('avenue', 0.38695406913757324),\n",
       " ('proclus', 0.3855825662612915),\n",
       " ('greektown', 0.3836863040924072),\n",
       " ('ptolemy', 0.38028305768966675),\n",
       " ('berea', 0.37123364210128784)]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['euclid'], topn=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('atanas', 0.6365466713905334),\n",
       " ('fery', 0.4410214424133301),\n",
       " ('simeonov', 0.4386071562767029),\n",
       " ('atanassov', 0.4376071095466614),\n",
       " ('mladenov', 0.4347333312034607),\n",
       " ('sergeevich', 0.4314761757850647),\n",
       " ('neophytos', 0.4266960620880127),\n",
       " ('geleta', 0.419179230928421),\n",
       " ('vassilev', 0.41890764236450195),\n",
       " ('stoev', 0.414333313703537)]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.most_similar(positive=['atanasov'], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I expected to be a lot worse. Names of famous people gets us other names of people with the same profesion - US presidents and greek mathematicians come up pretty easily. \n",
    "\n",
    "But with some less known figures, like a general in a certain battle, it woulnd't work. In those cases it would be good if we find other names in the same text or if we're working with a textbook we can use the names from other topics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll keep it simple. We just need the *count* amount of distractors (incorrect answers)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_distractors(answer, count):\n",
    "    answer = str.lower(answer)\n",
    "    \n",
    "    ##Extracting closest words for the answer. \n",
    "    try:\n",
    "        closestWords = model.most_similar(positive=[answer], topn=count)\n",
    "    except:\n",
    "        #In case the word is not in the vocabulary, or other problem not loading embeddings\n",
    "        return []\n",
    "\n",
    "    #Return count many distractors\n",
    "    distractors = list(map(lambda x: x[0], closestWords))[0:count]\n",
    "    \n",
    "    return distractors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['hydrogen', 'nitrogen', 'helium', 'nutrients']"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_distractors('oxygen', 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['romania', 'hungary', 'ukraine', 'slovakia', 'bulgarian', 'macedonia']"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_distractors('bulgaria', 6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
